In this project, I have chosen to explore and analyze the white wine quality dataset. This dataset contains 4898 white wines with 11 variables on qualifying different attributes. An output variable is also given in the dataset which is the rating of each wine between 0 and 10. In this project, I will analyze the realations between the wine attributes and ratings, and I will explore if there is any strong relationship between the different attributes of the wines.
In this section, I have loaded the data and the variable names are shown in the below.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Now let’s see the structure of the variables:
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
We can find there is an X variable there, which is just the indices of wines. Since there is the no missing data in this dataset, I just simply showed the summary for each variable in the below.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
In this section, I will plot several histograms to explore the count distributionsof wines for different variables.
First let’s take a look at the ratings of the wines.
We can find the ratings of the wines follow a normal distribution with center at 6, which shows most of wines got ratings at 5 and 6.
Let’s take a look at the alcohol, we can find with higher alcohol percentage, the counts of wines are decreasing. Alcohol with about 9% have most counts and the data is left skewed.
Let’s take a look at the fixed acidity. We can find the most of wines has fixed acidity between 6 and 8 g/dm^3.
The above histogram is the count of total sulfur dioxide. We can find most of wines have total sulfur dioxide between 100 and 200 mg/dm^3.
This histogram shows the counts for wines with different pH. Most of wines have pH around 3.0 and 3.3.
This histogram shows the counts for wines with residual sugar, we can find most wines have residual sugar under 2.5 g/dm^3.
Last, let’s plot the histograms for every variable in the data under same plot.
There are 4898 observations and 13 variables in this dataset. Among the vaiables, X is the index of the wines and quality is the rating for each wine, and their data type is int. The quality is dependent on all the other variables, which are properties of the wines and they have float data type.
In this dataset, I’m interested in the relations between pH, alcohol and quality. I would like to explore if there is any strong relationship between them.
Density, volatile acidity and free sulfur dioxide may also support my investigation.
I didn’t create any new vaiables by far since I’m not familar with all the chemicals. For different chemicals, the standards of high or low is unclear.
Some data are skewed to the left and some are normally distributed, there is no noticable or unusual distributions in the dataset.
In this part, let’s take a look at some bivariate plots and try my interests on some variables of this dataset. First let’s take a look at box plot for wine quality.
We can find wines with higher quality rating, above 6, among most of those, the alcohol percentage is above 10%.
For the plot above, we can find wines with different quality rating, their pH is normally distributed and their is no strong relationship between each other.
The above graph is the scatter plot of pH vs. alcohol. In this graph, we didn’t see any strong relationship between pH and alcohol.
The above graph is the scatter plot of residual.sugar vs. pH. In this graph, we didn’t see any strong relationship between residual.sugar and pH.
Density, volatile acidity and free sulfur dioxide may also support my investigation. pH, alcohol and quality
Above is the scatter plot of volatile.acidity vs. pH. My assumption is volatile acidity will affect pH, but from the scatter plot above we didn’t see a strong relationship between each other.
The above plot is total.sulfur.dioxide vs density. We can find with more sulfur dioxide, the density of wine increases.
Let’s take a look at the alcohol vs. density. We can find with the increase on alcohol, the density of the wine drops.
We can find with the plot of pH vs. density, there is no strong relationship between pH and density.
From the investigation above, we can find there is no strong relationship for pH vs alcohol, residual.sugar vs pH, volatile vs pH, ph vs density. There is a strong linear relationship for total.sulfur.dioxide vs density and alcohol vs density.
Yes, at the beginning I didn’t have much interests on density but we can find alcohol and total.sulfur.dioxide will affect density.
The strongest relationship is between alcohol and density. With more alcohol in wines, the density drops.
In this section, I have ploted several scatter graphs with quality as factor.
First graph above is the relationship between pH and alcohol. We can find it’s hard to observe any useful information which related to the quality of the wines.
The second graph above the is the scatter plot of volatile.acidity vs pH. We can find most of wines with quality rating above five, their volatile acidity is around or under 0.25 g/dm^3, and spread with different pH.
The above graph is the scatter plot between alcohol and density. We can find wines with rating higher than 5 are more distributed with alcohol bigger than 11% and density smaller 0.996 g/cm^3.
The above graph is the scatter plot between pH and density. We can find most wines with rating bigger than 5 are spread out with different pH and density under 0.996 g/cm^3.
The last graph here is the scatter plot between total sulfur dioxide and density. We can find most of wines with rating higher than 5, their total sulfur dioxide is under 200 mg/dm^3. We can notice that their is a weak linear relationship between total sulfur dioxide and density. When sulfur dioxide increases, the density increases.
From the analysis above, density is actually an important property for rating wines. Higher alcohol percentage will also help improve the rating of alcohol. In cotrast, at the beginning I was interested in the effection of pH on wine quality, but it turns out it doesn’t affect much.
The density has a strong effection on wine quality, which is very surprising to me.
The reason why I chose this plot is it can reflect the distribution of wines for different quality ratings. From this plot, we can find wines with rating at 6 have most counts, and the distribution of ratings follow a normal distribution. This rating is based on all the other properties of wines and gives an idea what’s most wines look like.
Since I’m interested how alcohol will affect the ratings of wines, I chose this boxplot here to analyze the relationship. We can find for those wines which quality rating above 5, the mean and median of alcohol percentage is above 10%. The wines are mostly to have higher quality rating with higher alcohol.
Since in previous sections, I have discovered density can also be a important properties for wines, so here I chose this scatter plot to show the relationship between alcohol and density and how they affect the quality of wines. From this plot, we can find with higher alcohol percentage, the density decreases. From the linear regression line, we can find for wines with higher quality, they have high alcohol percentage and also high density. For wines with lower quality, it is the opposite.
From the last two graphs above, we can find with higher alcohol percentage, the wines will have higher rating. Probally this is related to the fermentation process of the wine making where alcohol was produced by the bacterias. In this project, I think the most struggling thing is I’m still not very familar with ggplot and use of R. There are some better ideas but I don’t know to implement them, which leaves me a large space to keep studying. What was surprising is the dataset provided by instructors. It is a lot of fun by doing this project and I really enjoyed. After I’m more familar with R programming, I will come back and explore more about this dataset.